STA 360 Lab 4: Introduction to Hamiltonian Monte Carlo
Overview
By now you have encountered at least two procedures for generating samples from a distribution over model parameters conditional on observed data (the posterior distribution of your model parameters). The first method, direct sampling, assumes that we can explicitly write down the form of the posterior density and/or distribution function, which means that we can generate samples from it with calls to standard R functions. The second method, Gibbs sampling, generates samples from a posterior distribution by building a Markov Chain whose stationary distribution is the desired posterior.
The benefit of Gibbs sampling is that it allows us to sample from distributions whose form we may not be able to write down. The drawback is that, unlike samples directly from the posterior, samples from the Gibbs sampler eventually behave as if they were drawn from the posterior distribution: now we have to be concerned about exactly how and when this convergence occurs. Later in the course you will learn about diagnostics for Gibbs samplers and some methods for monitoring convergence of Markov chains. You will write your own Gibbs samplers and implement some of these methods. You will also learn that the Gibbs sampler is just a special case of a class of procedures for sampling from a posterior distribution, which fall under the title Markov Chain Monte Carlo (MCMC).
Before you see those details, we will spend time in this lab building intuition for the mechanics of Markov Chain Monte Carlo methods and we will use that intuition to understand the advanced methods Stan is running under the hood to generate its samples.
MCMC and Gibbs sampling
One of the key intuitions to build when thinking about MCMC methods is that of “space exploration.” In the case of two-parameter models parametrized by \(\theta = [\theta_1, \theta_2] \in \mathbb{R}^2\), we can think of MCMC methods as sampling from the 2-d real-valued plane according to our target density
\[ p(\theta_1, \theta_2|X_1, \dots, X_n) \]
MCMC methods like the Gibbs sampler will move iteratively through the plane, transitioning from point to point and producing a set of \(S\) samples
\[ [\theta^{(1)}_1, \theta^{(1)}_2], \dots, [\theta^{(S)}_1, \theta^{(S)}_2] \]
When the target density is nicely behaved, the exploration of a Gibbs sampler is also “nice.” Regions of high density are visited with higher frequency than regions of lower density. All regions of the plane are both in theory and in practice accessible with only a few transition steps. An example of a nice density on \(\mathbb{R}^2\) is a bivariate normal with a modest degree of dependence between its two components \(\theta_1, \theta_2\):
theta_1 <- seq(-3, 3, length.out = 250)
theta_2 <- seq(-3, 3, length.out = 250)
expand.grid(theta_1 = theta_1, theta_2 = theta_2) %>%
ggplot2::ggplot() +
geom_tile(aes(x = theta_1, y = theta_2, fill = dnorm(theta_1)*dnorm(theta_2)*(exp(0.25*theta_1*theta_2)))) +
stat_contour(aes(x = theta_1, y = theta_2, z = dnorm(theta_1)*dnorm(theta_2)*(exp(0.25*theta_1*theta_2)))) +
theme_minimal() +
scale_fill_distiller(palette = "Spectral", name = "Density") +
labs(x = expression(theta[1]), y = expression(theta[2])) +
coord_fixed()What’s an example of a “not nice” density for moving through the plane with Gibbs transition steps? Let’s look at an example of a bivariate density that would likely give the Gibbs sampler some trouble:
theta_1 <- seq(-3, 3, length.out = 250)
theta_2 <- seq(-3, 3, length.out = 250)
expand.grid(theta_1 = theta_1, theta_2 = theta_2) %>%
ggplot2::ggplot() +
geom_tile(aes(x = theta_1, y = theta_2,
fill = dnorm(0, mean = (theta_1^2 + theta_2^2) + theta_1/(theta_1^2 + theta_2^2)))) +
stat_contour(aes(x = theta_1, y = theta_2, z = dnorm(0, mean = (theta_1^2 + theta_2^2) + theta_1/(theta_1^2 + theta_2^2)))) +
theme_minimal() +
scale_fill_distiller(palette = "Spectral", name = "Density") +
labs(x = expression(theta1[1]), y = expression(theta1[2])) +
coord_fixed()The problem here is that there are regions with very high density (the red peaks in the middle) right next to regions of very low density. It will take many transition steps for the Markov Chain to reach the peaks, and will also take many transitions to get out of the neighborhood of the peaked regions once it moves near it.
While the asymptotic properties of the Markov chain still hold, the sticking behavior makes it difficult to get accurate posterior summaries without taking an infinite number of samples and waiting an infinitely long time for them. As with many other problems in statistics, this problem often gets worse as the number of dimensions in the parameter space increases.
Hamiltonian Monte Carlo (HMC)
Even less extreme examples than the one shown above will cause Gibbs samplers to explore the sample space more slowly than we’d like. Let’s look at a simpler case and write the code for our own Gibbs sampler. Later we will compare it to Stan’s behavior on the same problem and discuss the benefits of using HMC.
Consider data generated from a bivariate normal distribution with mean parameters \(\theta_1, \theta_2\) and known covariance matrix \(\Sigma\) and suppose we place independent normal priors on \(\theta_1, \theta_2\):
\[ \begin{aligned} X_1, \dots, X_n &\sim N \left(\theta, \Sigma \right) \\ \theta_j &\sim N(0, 1)~~~~~~j=1,2 \end{aligned} \]
Suppose too that the covariance matrix \(\Sigma\) has a specific form with unit-variance marginal distributions and known correlation parameter \(\rho\):
\[ \Sigma = \left[\begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array}\right] \]
In this course you have derived (or will derive) the multivariate posterior density \(p(\theta | \Sigma, X)\). However, for the purposes of this exercise, we will sample from the full conditional densities
\[ p(\theta_1 | X_1, \dots, X_n, \rho, \theta_2) \\ p(\theta_2 | X_1, \dots, X_n, \rho, \theta_1) \]
Before building the Gibbs sampler to make inferences on \(\theta_1, \theta_2\), first answer these questions:
- What is the conditional density \(p(\theta_1 | X_1, \dots, X_n, \rho, \theta_2)\)?
- What is the conditional density \(p(\theta_2 | X_1, \dots, X_n, \rho, \theta_1)\)?
Now that you have derived the full condition densities, write some code to implement a Gibbs sampler for this model.
normal_gibbs_sampler <- function(S, X, rho){
theta_1 <- rep(0, S)
theta_2 <- rep(0, S)
n <- nrow(X)
for(s in 2:S){
theta_1[s] <- rnorm(1,
mean = mean(X[,1]) + rho*(theta_2[s-1] - mean(X[,2])),
sd = sqrt((1 - rho^2)/n))
theta_2[s] <- rnorm(1,
mean = mean(X[,2]) + rho*(theta_1[s] - mean(X[,1])),
sd = sqrt((1 - rho^2)/n))
}
return(cbind(theta_1, theta_2))
}With the Gibbs sampling code in hand, let’s generate samples from the posterior distribution of \(\theta_1, \theta_2\) with \(\rho = 0.2\). We’ll do the same using Stan.
n <- 100
rho <- 0.2
X <- MASS::mvrnorm(n = n, mu = c(2, 4), Sigma = matrix(c(1, rho, rho, 1), nrow = 2))
norm_gibbs_samps <- normal_gibbs_sampler(600, X, rho)
#
true_post <- MASS::mvrnorm(n = 100000,
mu = colMeans(X),
Sigma = matrix(c(1, rho, rho, 1) / nrow(X), nrow = 2))
data.frame(norm_gibbs_samps) %>%
magrittr::set_colnames(c("theta_1", "theta_2")) %>%
dplyr::mutate(iter = 1:n()) %>%
dplyr::filter(iter > 100) %>%
dplyr::mutate(iter = 1:n()) %>%
ggplot2::ggplot() +
geom_density2d(data = data.frame(true_post) %>%
magrittr::set_colnames(c("true_1", "true_2")),
aes(x = true_1, y = true_2)) +
geom_path(aes(x = theta_1, y = theta_2, colour = iter), alpha = 0.2, size = 0.5) +
geom_point(aes(x = theta_1, y = theta_2, colour = iter), size = 0.5) +
scale_color_distiller(palette = "Spectral", name = "Iter") +
labs(x = expression(theta[1]), y = expression(theta[2]))#
stan_res <- rstan::stan("hmc_norm_example.stan", data = list(X = X,
N = nrow(X),
Sigma = matrix(c(1, rho, rho, 1), nrow = 2)),
chains = 1, iter = 600, warmup = 100, verbose = F, refresh = 0) %>%
rstan::extract()
data.frame(stan_res$theta) %>%
magrittr::set_colnames(c("theta_1", "theta_2")) %>%
dplyr::mutate(iter = 1:n()) %>%
ggplot2::ggplot() +
geom_density2d(data = data.frame(true_post) %>%
magrittr::set_colnames(c("true_1", "true_2")),
aes(x = true_1, y = true_2)) +
geom_path(aes(x = theta_1, y = theta_2, colour = iter), alpha = 0.2, size = 0.5) +
geom_point(aes(x = theta_1, y = theta_2, colour = iter), size = 0.5) +
scale_color_distiller(palette = "Spectral", name = "Iter") +
labs(x = expression(theta[1]), y = expression(theta[2]))The Gibbs sampling results and the HMC results look pretty similar! What happens when \(\rho = 0.995\)?
n <- 100
rho <- 0.995
X <- MASS::mvrnorm(n = n, mu = c(2, 4), Sigma = matrix(c(1, rho, rho, 1), nrow = 2))
norm_gibbs_samps <- normal_gibbs_sampler(600, X, rho)
#
true_post <- MASS::mvrnorm(n = 100000,
mu = colMeans(X),
Sigma = matrix(c(1, rho, rho, 1) / nrow(X), nrow = 2))
data.frame(norm_gibbs_samps) %>%
magrittr::set_colnames(c("theta_1", "theta_2")) %>%
dplyr::mutate(iter = 1:n()) %>%
dplyr::filter(iter > 100) %>%
dplyr::mutate(iter = 1:n()) %>%
ggplot2::ggplot() +
geom_density2d(data = data.frame(true_post) %>%
magrittr::set_colnames(c("true_1", "true_2")),
aes(x = true_1, y = true_2)) +
geom_path(aes(x = theta_1, y = theta_2, colour = iter), alpha = 0.2, size = 0.5) +
geom_point(aes(x = theta_1, y = theta_2, colour = iter), size = 0.5) +
scale_color_distiller(palette = "Spectral", name = "Iter") +
labs(x = expression(theta[1]), y = expression(theta[2])) +
xlim(c(1.5, 2.25)) +
ylim(c(3.5, 4.25))#
stan_res <- rstan::stan("hmc_norm_example.stan", data = list(X = X,
N = nrow(X),
Sigma = matrix(c(1, rho, rho, 1), nrow = 2)),
chains = 1, iter = 600, warmup = 100, verbose = F, refresh = 0) %>%
rstan::extract()
data.frame(stan_res$theta) %>%
magrittr::set_colnames(c("theta_1", "theta_2")) %>%
dplyr::mutate(iter = 1:n()) %>%
ggplot2::ggplot() +
geom_density2d(data = data.frame(true_post) %>%
magrittr::set_colnames(c("true_1", "true_2")),
aes(x = true_1, y = true_2)) +
geom_path(aes(x = theta_1, y = theta_2, colour = iter), alpha = 0.2, size = 0.5) +
geom_point(aes(x = theta_1, y = theta_2, colour = iter), size = 0.5) +
scale_color_distiller(palette = "Spectral", name = "Iter") +
labs(x = expression(theta[1]), y = expression(theta[2])) +
xlim(c(1.5, 2.25)) +
ylim(c(3.5, 4.25))Please answer these questions:
- How do the results of the Gibbs sampler differ from those obtained from HMC?
- Why do the samples from the Gibbs sampler exhibit this behavior?